RenderFlow
Single-Step Neural Rendering via Flow Matching
Zhang et al.
Presented by Manish Mathai
March 5, 2026
Scene
Geometry
Materials
Lights
- Geometry: surfaces made of millions of tiny triangles
- Materials: color, roughness, metallic – how surfaces look
- Lights: environment maps, point lights, the sun
Path Tracing
- The gold standard for realism
- Trace rays of light as they bounce around the scene
- Each bounce picks a random direction. It needs thousands of rays to converge
- Problem: very expensive
- Convergence can take minutes to hours per image
- Need many samples(light rays) per pixel to reduce noise
Samples vs. Quality
![]()
1 SPP (top-left) to 32,768 SPP (bottom-right), doubling left to right.
What if we could skip the expensive sampling entirely?
Rasterization
![]()
- Project triangles to pixels instead of tracing rays
- First pass: Store scene properties as images called G-buffers:
- Albedo (base color), Normals, Depth, Roughness, Metallic
- Second pass: Combines these for fast shading
- No global effects: soft shadows, reflections, indirect light. They are faked with cheaper, rough approximations
- This pipeline is called deferred rendering
The Gap
Path Tracing
- Physically accurate
- Minutes to hours per frame
- The “ground truth”
Deferred Rendering
- Real-time (milliseconds)
- Misses global illumination
- G-buffers are cheap to produce
Path Tracing vs Rasterization
The Quality Gap: Interiors
![]()
Rasterization (left) vs path tracing (right)
The Quality Gap: Subtle Details
Best of both worlds?
Can we get path-tracing quality from G-buffers… using a neural network?
Prior Work: Diffusion Models
![]()
- Models like Stable Diffusion, DALL-E learn to generate images from noise
- Forward process: gradually add noise to an image until it’s pure static
- Reverse process: a neural network learns to undo the noise, step by step
- Typically needs 20-50 denoising steps to produce a clean image
- What if we condition the reverse process on G-buffers to guide it toward a rendered image?
RGB-X (SIGGRAPH 2024)
![]()
- Condition a diffusion model on G-buffers to synthesize realistic images
- Estimates intrinsic channels: the surface properties like albedo, normals, etc.
- Also works in reverse: RGB image -> G-buffer decomposition
- ~50 denoising steps, ~2.2 seconds per frame
DiffusionRenderer (CVPR 2025)
![]()
- Extends the idea to video using a video diffusion model
- Handles temporal consistency across frames
- Trained on synthetic + auto-labeled real-world data
- Enables relighting, material editing, and object insertion from a single video
- ~30 denoising steps, ~1.4 seconds per frame
But Two Problems…
- Slow: 20-50 denoising steps per frame
- RGB-X: ~2.2 seconds per frame
- DiffusionRenderer: ~1.4 seconds per frame
- Not real-time (need < 33ms for 30fps)
- Stochastic: different random seeds produce different results
- Flickering between frames: shadows appearing and disappearing, lighting and reflections shifting
- Not reproducible, making it bad for production pipelines
Flow Matching
![]()
- Alternative to diffusion: learn a velocity field that transports samples from source to target distribution
- Deterministic: follows a deterministic ODE instead of a stochastic process
- The same input gives the same output every time
- Rectified flow: encourages straight-line trajectories between paired samples
- Straight paths incur zero discretization error – can be solved in as few as 1 Euler step
RenderFlow: Key Idea
![]()
- Learn a single-step flow: G-buffers -> rendered image
- Key insight: replace noise with albedo as the starting point
- Already spatially aligned with the target. It has the right colors, textures, structure
- Network only learns the residual: shadows, reflections, global illumination
- Much smaller “distance” than noise -> image. A single step suffices
- Single forward pass: ~0.19s per frame (10x faster than diffusion methods)
How to Train It: Bridge Matching
- Pure flow matching trains on exact straight-line paths. That can be brittle
- Bridge matching: add small noise perturbations during training only
- \(z_t = (1-t)z_0 + tz_1 + \sigma\sqrt{t(1-t)}\epsilon\)
- \(z_0\) = albedo, \(z_1\) = rendered image, \(\sigma\) = noise scale, \(\epsilon\) = random noise
- Acts as a regularizer and the model sees diverse variations of the path
- When \(\sigma = 0\), reduces to pure flow matching
- Result: more robust to variations in lighting and materials
- Inference remains deterministic – noise is a training trick only
Train Multi-Step, Infer Single-Step
- Train with bridge matching at 4 discrete timesteps [1.0, 0.75, 0.5, 0.25], \(\sigma = 0.005\)
- But infer in 1 step. Just one forward pass
- Why does this work?
- Multi-step training exposes the model to intermediate states
- Single-step inference avoids error accumulation across steps
| 4-step ODE |
4 steps |
23.09 |
| 4-step ODE |
1 step |
23.30 |
| 4-step SDE |
4 steps |
23.38 |
| 4-step SDE |
1 step |
23.59 |
1-step inference outperforms multi-step because fewer steps means less error accumulation
PSNR (Peak Signal-to-Noise Ratio): higher = better. Ablation at 256x256.
Architecture
![]()
- Repurposes a pretrained video diffusion model (Wan2.1, 1.3B parameter DiT)
- Albedo replaces noise input; text cross-attention removed
- All inputs (G-buffers + environment map) encoded by VAE into latent space
- G-buffer tokens added element-wise to albedo tokens (spatially aligned as same pixel locations)
- Envmap Adapter: environment map injected via adaptive normalization (scale + shift) because not spatially aligned like G-buffers
Training Losses
- Latent loss: bridge matching loss in VAE latent space (the core objective)
- Pixel losses applied after decoding back to image space:
- LPIPS: perceptual similarity (captures structural differences humans notice)
- Gradient loss: preserves high-frequency details like contact shadows
- Total: \(\mathcal{L}_{total} = \mathcal{L}_{latent} + \lambda \mathcal{L}_{pixel}\)
Video Inference
- Model trained on short clips (5 frames) for memory efficiency
- Long videos rendered in overlapping chunks:
- Last frame of chunk N becomes the conditioning frame for chunk N+1
- Promotes smooth transitions and temporal coherence
- Combined with keyframe guidance for best results
Keyframe Guidance
![]()
- G-buffers alone lack global lighting info (shadows, reflections are ambiguous)
- Solution: feed sparse path-traced keyframes as additional guidance
- e.g., one high-quality reference frame every 16 frames
- Keyframe Adapter: cross-attention branch injected into each transformer block
- Uses RoPE to encode temporal distance between keyframe and current frame
- Two-stage training: train base model first, freeze it, then train only the adapter
- Base performance unchanged when no keyframes are provided
Keyframe Guidance: Impact
| No keyframes |
24.02 |
| Every 49 frames |
25.92 |
| Every 25 frames |
26.57 |
| Every 13 frames |
29.72 |
- Even sparse keyframes (every 49 frames) significantly outperform no guidance
- More keyframes = better quality, as expected
- Negligible speed impact (~0.24s vs ~0.19s per frame)
Ablation at 256x256 (Supplementary Table S1); full-resolution results in Table 1
Inverse Rendering
![]()
- Can we run the model backwards? RGB image -> G-buffers?
- Freeze the entire forward model, add lightweight adapters:
- LoRA on self-attention (same pattern as LLM fine-tuning)
- Cross-attention conditioned on a text prompt (“albedo”, “normal”, etc.)
- Per-intrinsic MLP heads for each output type
- One unified model handles both forward and inverse rendering
Results: Quantitative
Traditional baseline (not a neural method):
| Deferred |
Traditional |
- |
24.65 |
0.097 |
real-time |
Neural rendering methods:
| RGB-X |
Diffusion |
950M |
20.98 |
0.165 |
~2.19 |
| DiffusionRenderer |
Diffusion |
1.7B |
23.76 |
0.128 |
~1.40 |
| Ours (w/o key) |
Flow |
1.4B |
24.21 |
0.113 |
~0.19 |
| Ours (w/ key) |
Flow |
1.7B |
26.66 |
0.101 |
~0.24 |
LPIPS (Learned Perceptual Image Patch Similarity): lower = better perceptual quality
- 10x faster than RGB-X, 7x faster than DiffusionRenderer
- Outperforms both neural baselines on all metrics, even without keyframes
- With keyframes: surpasses even traditional deferred rendering
Results: Deterministic
![]()
- RenderFlow: zero variance across runs (deterministic)
- Diffusion baselines: significant variance (stochastic)
- Same input always produces the exact same output
- Critical for production: no flickering, reproducible results
Results: Visual Comparison
![]()
Dataset
- No existing large-scale rendering dataset with G-buffers + environment maps
- Built a custom dataset using Unreal Engine 5 Movie Render Queue:
- Artist-crafted: 30,000 frames from professional scenes
- Procedural: 100,000 frames from randomly composed scenes
- 4,000 unique meshes, 30 HDR environment maps
- Randomized material attributes for diversity
- All rendered at 512x512, 256 SPP, denoised with Intel Open Image Denoise
- Both baselines (RGB-X, DiffusionRenderer) fine-tuned on this same dataset
Limitations
- VAE bottleneck: encoder/decoder accounts for ~90% of inference time
- The transformer itself is fast; the VAE is the constraint
- Dataset diversity: synthetic scenes only, limited lighting phenomena and geometric complexity
- Fails on highly complex geometries (fine-grained details lost in VAE compression)
- Temporal blurring: causal VAE convolution causes later frames to blur
- Initial frame stays sharp; subsequent frames progressively soften
- Resolution: trained and evaluated at 512x512 only
Discussion
- Key takeaway: flow matching + albedo starting point = single-step rendering
- Three contributions:
- Single-step flow-based rendering (10x faster, deterministic)
- Keyframe guidance adapter (significant quality boost)
- Inverse rendering via frozen backbone + adapters
- Open questions:
- Can the VAE bottleneck be eliminated?
- How does this scale to 1080p or 4K?
- Could this work with real-world captured scenes (not just Unreal Engine 5)?
- What about dynamic lighting changes within a sequence?
Thank You!
Questions and discussion